Python Cookbook

Notes on useful functions or operations in Python.

Basic Calculations

1
2
3
4
5
6
7
8
# Exponentiation
4 ** 2

# Modulo
18 % 7

# String summation
'ab' + 'cd' # returns ‘abcd'

Basic Manipulations

Lists

Characteristics

  • Name a collection of values. d = [a,b,c]
  • Can contain any type and different types
  • List of lists: d2 = [[a,1],[b,2],[c,3]]

Subsetting lists
zero-based indexing: start from 0, [start (inclusive):end (exclusive)]

  • x[0]: returns the first element
  • x[-1]: returns the last element
  • x[-2]: returns the last but one element
  • x[3:5]: returns the fourth and fifth element. The element represented by index 5 is not selected, so this slicing is like [start:end) mathemetically
  • x[:4]: returns all the element from the start to the fourth
  • x[5:]: returns all the element from the sixth element to the last one

List Manipulation

  • Replaced the indexed part of the list with desired values

    1
    x[0:2] = [a, b]
  • Adding and removing elements

    1
    2
    3
    4
    5
    # adding
    x = x + [a, b] #append a and b to the end of the list

    # removing
    del(x[2])

*Important Note: when you create a new list, what actually happens is that you store a list into the computer memory, and store the “address” of the list to the variable. This means that the variable actually does not contain all the list elements, but rather contains a reference to the list elements. This difference is especially important when you try to copy the list:*

1
2
3
x = ['a', 'b', 'c']
y = x
y[1] = 'z'

Now if you print y, you will see the following output: ['a', 'z', 'c'], while interestingly, the element in x is also changed into ['a', 'z', 'c'].

That is because when you copy x to y with an equal sign, you copied the reference to y, not the list elements themselves. Therefore, when you are updating an element in the list, which was stored in the computer memory, both x and y, whose reference point to this list, will return changed outcome.

If you want to create a list y with a new list of elements but same values as x, you should use y = list(x) or y = x[:] to select all the elements explicitly. Now when you update the elements in y, x will not change accordingly.

Data Frames

Summing column-wise and row-wise

1
2
3
4
5
# column-wise
temp.sum()

# row-wise
temp.sum(axis = 'columns')

Exploring Dataset

Initial Examination

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Import the pandas library as pd
import pandas as pd

# Reading Files
data = pd.read_csv('ransom.csv')

# Examining the whole dataset
data.info() #shows number of cols and rows, variables' dayatype and memory usage
data.describe() #returns summary statistics of each column
data.head() #input number n to show top n lines
data.tail() #input number n to show bottom n lines
data.shape #number of cols and rows
data.dtypes #all columns' data types
data.columns #examine the columns

# Examning one variable
type(variable_a) #returns the data type of the variable
data.col_a.value_counts() #count unique values
data.col_a.value_counts(normalize = True) #output proportions of unique values

# Tally table
pd.crosstab(ri.driver_race, ri.driver_gender) #shows a frequency table to tally how many times each combination of values occurs

Merging Data Frames

1
2
3
4
5
6
apple_high = pd.merge(
left = apple,
right = high,
left_on = 'date',
right_on = 'DATE',
how = 'left') # type of join

Converting Data Types

1
2
3
4
a = float(b) # convert into float
str()
int()
bool()

When there are only a few different values in a column, it is more efficient in memory to convert it to a categorical data type. And it will also allows you to specify a logical order for the categories

1
2
3
4
5
6
7
8
9
10
11
ri.stop_length.unique()
# Result: array(['short', 'medium', 'long'], dtype=object)

# Specifying order
cats = ['short', 'medium', 'long'] # a<b<c

# Convert the type
ri['stop_length'] = ri.stop_length.astype('category', ordered = True, categories = cats)

# Now you can use comparison operators on this column
ri[ri.stop_length > 'short'].shape # filter only medium and long

Data Cleaning

Dealing with NAs

1
2
3
4
5
6
7
8
# Count the number of missing values in each column
data.isnull().sum()

# Drop unuseful columns
data.drop(['column_a', 'column_b'], axis = 'columns', inplace = True)

# Drop rows that have NAs in critical columns
data.dropna(subset = ['column_a', 'column_b'], inplace = True)

Dealing with Data Types

  • Changing data types

    1
    2
    3
    4
    5
    6
    # examine the data types of all columns
    data.dtypes

    # change the data type of one column
    data['col_a'] = data.col_a.astype('bool')
    ## datetime, categorical
  • Date time index

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # Concatenate date and time columns
    combined = data.date.str.cat(data.time, sep = ' ')

    # Convert 'combined' to datetime format
    data['datetime'] = pd.to_datetime(combined)

    # Setting datetime as index
    data.set_index('datetime', inplace = True)

    # Examine the index
    data.index

    # Reset the index back to a column
    data.reset_index(inplace = True)

Data Manipulation

Slicing Columns and Rows

1
2
3
4
5
6
7
# select column
items = data.item
items = data['item']

# select rows using logicals
expensive_items = data[data.price > 20]
expensive_items = data[data['price 1'] > 20]

Aggregation

  • Using Groupby

    1
    2
    3
    #groupby
    a = data.num_people.mean()
    a_by_month = data.groupby(data.index.month).num_people.mean()

    Or using multiple variables as groupby criteria to create multi-indexed series

    1
    2
    3
    4
    5
    6
    7
    8
    search_rate = ri.groupby(['violation', 'driver_gender']).search_conducted.mean()
    # the resulted multi-indexed series is very similar to a data frame.

    # we can use the loc accessor to slice it
    search_rate.loc['Equipment', 'M']

    # we can also convert it to a data frame by unstacking it
    search_rate.unstack()
  • Using Pivot Table
    To save the trouble of groupby and then unstacking, we can directly use pivot_table to achieve the same result.

    1
    2
    3
    4
    5
    # pivot table use mean as a default aggregation method
    ri.pivot_table(
    index = 'violation',
    columns = 'driver_gender',
    values = 'search_conducted')
  • Using Resampling based on date time
    The resulting groups will be the last day of each month, rather than just 1, 2, and 3 like the groupby ones.

    1
    2
    a = data.num_people.resample('M').mean()
    #M indicates month, A indicates year(annual)

Mapping

Dictionary maps the values you have to the values you want.

1
2
3
4
5
6
# mapping up to True, and down to False
mapping = {'up':True, 'down':False} # before:after
apple['is_up'] = apple.change.map(mapping)

# using mean() to calculate the percentage of up
apple.is_up.mean()


Basic Plots

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from matplotlib import pyplot as plt
# Line Plot
plt.plot(x_values, y_values) # add another plt.plot(x2,y2) below if want to disply multiple plots together

# if you have already have a series with index, you can directly call without specifying x and y:
plt.plot(indexed_series)

# Scatter plot
plt.scatter(
x_value,
y_value,
alpha = 0.1 # change marker transparency
)

# Bar chart
plt.bar(
x_value,
y_value,
yerr = df.sd_column
# use one column in the data that shows sd,
# this would display a straight line that sticks on the top of each bar
)
plt.ylabel('y_label')

# Horizontal bar chart
plt.barh(x_value, y_value)

# Stacked bar chart
plt.bar(x_value, y_value1, label = '1')
plt.bar(x_value, y_value2, bottom = y_value1, label = '2')
## this will make y_value2 to be stacked on y_value1
plt.legend()

# Histogram
plt.hist(
value,
bins = 40, # number of bins
range = (50, 100), # set the min and max range
density = True # Normalization
## Normalization reduces the height of each bar by a constant factor,
## so that the sum of the areas of each bar adds to one. We do it when
## we try to compare two dataset with very different sizes
)

# remember to show the plot in the end
plt.show()

Styling Graphs

Change overall styles:

1
plt.style.use("fivethirtyeight") #change a set of styles, including the background, legends ...

  • Overall Styles: fivethirtyeight, ggplot, seaborn, default, etc.

Change some specific styles:

1
2
3
4
5
6
7
8
plt.plot(
x_values,
y_values,
color = "tomato", # colors can be found on by "web color" in wikipedia
linewidth = 1, #help to emphasize a certain line
linestyle = "--", #line type
marker = "x" #marker to notify a point
)

  • Linestyle:

    1. “–”: dashed line
    2. “-.”: dot/dash line
    3. “:”: dotted line
  • Marker:

    1. “x”: cross
    2. “s”: solid square
    3. “o”: solid circle
    4. “d”: solid diamond
    5. “*“: solid pentagon
    6. “h”: solid hexagon

Adding text to plots

put them before plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
# title
plt.xlabel("x axis title")
plt.ylabel("y axis title")
plt.title("plot title", fontsize = 20, color = 'green') # change text fontsize, color

# legend
plt.plot(x_values, y_values, label = "A") #Label keyword argument, which will show in our legend
plt.plot(x_values, y_values, label = "B")
plt.plot(x_values, y_values, label = "C")
plt.legend() # tells matplotlib to show legends

# text to a certain point
plt.text(xcoord, ycoord, "Text Message")

Making Comparisons

Line graph

Usually deal with time related comparisons. We want to compare the trend of two variables across time.

We can compare two variables after conducting the same aggregation on both of them with one plot.

1
2
3
4
5
6
7
8
9
10
11
# calculate the aggregation for both variables
monthly_price = apple.price.resample('M').mean()
monthly_volume = apple.volume.resample('M').mean()

# concat the two variables side by side
monthly = pd.concat([monthly_price, monthly_volume], axis = 'columns')

# plot two variables
## specifying to put the two variables into two subplots since two variables often have different scales
monthly.plot(subplots = True)
plt.show()

Bar chart

If we only have one categorical variable, we can simply plot it with a bart chart and sort the order

1
2
3
4
5
6
7
8
# output a series with categorical values on the index, calculated numbers in the column (sorted in alphabetical order)
search_rate = ri.groupby('violation').search_conducted.mean()

# Sort the series in descending order by its value before plotting
search_rate.sort_values().plot(kind = 'bar')

# makes the label easier to read by rotation the bars
search_rate.sort_values().plot(kind = 'barh')

If we have two categorical variables, we may want to compare the counts of different combinations of categories.

1
2
3
4
table.plot(kind = 'bar')

#stacked bar chart
table.plot(kind = 'bar', stacked = True)

Box Plot

1
weather[['TMIN', 'TAVG', 'TMAX']].plot(kind = 'box')

Strings

Regular Expression

  1. * represents 1 character